An In-Depth Analysis of Gun Violence in America¶

Will M, Ethan B, Zichao L¶

Part 1: Introduction¶

Gun violence has become a significant problem in America today. We are constantly reminded by news reports and social media that gun violence is a part of our lives - as a result, our lives are being disrupted by this threat. Schools are enforcing shooting drills, products like bulletproof vests are becoming ever more common, and our politics are being divided over what the right thing to do is.

In 2020, gun violence was the most common cause of death among people younger than 19. Between 1968 and 2011, an estimated 1.4 million Americans died from gun violence. The gun-related homicide rate in the United States is 25 times higher than in other developed countries. Because of these statistics, it makes sense that the general public be informed about this issue.

In this tutorial, we will do an in-depth analysis of the history, causes and effects of gun violence. The data we will be using can be found <a id = "https://github.com/jamesqo/gun-violence-data%22here%3E. The ultimate goal is to understand the factors that contribute the most to gun violence.

Part 2: Data¶

We will start by importing the necesary packages.

In [1]:
import pandas as pd
import numpy as np

The first thing we need to do is to read in our data. This can be done with pandas, and here is the result:

In [2]:
data = pd.read_csv("stage3.csv")
data.head()
Out[2]:
incident_id date state city_or_county address n_killed n_injured incident_url source_url incident_url_fields_missing ... participant_age participant_age_group participant_gender participant_name participant_relationship participant_status participant_type sources state_house_district state_senate_district
0 461105 2013-01-01 Pennsylvania Mckeesport 1506 Versailles Avenue and Coursin Street 0 4 http://www.gunviolencearchive.org/incident/461105 http://www.post-gazette.com/local/south/2013/0... False ... 0::20 0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A... 0::Male||1::Male||3::Male||4::Female 0::Julian Sims NaN 0::Arrested||1::Injured||2::Injured||3::Injure... 0::Victim||1::Victim||2::Victim||3::Victim||4:... http://pittsburgh.cbslocal.com/2013/01/01/4-pe... NaN NaN
1 460726 2013-01-01 California Hawthorne 13500 block of Cerise Avenue 1 3 http://www.gunviolencearchive.org/incident/460726 http://www.dailybulletin.com/article/zz/201301... False ... 0::20 0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A... 0::Male 0::Bernard Gillis NaN 0::Killed||1::Injured||2::Injured||3::Injured 0::Victim||1::Victim||2::Victim||3::Victim||4:... http://losangeles.cbslocal.com/2013/01/01/man-... 62.0 35.0
2 478855 2013-01-01 Ohio Lorain 1776 East 28th Street 1 3 http://www.gunviolencearchive.org/incident/478855 http://chronicle.northcoastnow.com/2013/02/14/... False ... 0::25||1::31||2::33||3::34||4::33 0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A... 0::Male||1::Male||2::Male||3::Male||4::Male 0::Damien Bell||1::Desmen Noble||2::Herman Sea... NaN 0::Injured, Unharmed, Arrested||1::Unharmed, A... 0::Subject-Suspect||1::Subject-Suspect||2::Vic... http://www.morningjournal.com/general-news/201... 56.0 13.0
3 478925 2013-01-05 Colorado Aurora 16000 block of East Ithaca Place 4 0 http://www.gunviolencearchive.org/incident/478925 http://www.dailydemocrat.com/20130106/aurora-s... False ... 0::29||1::33||2::56||3::33 0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A... 0::Female||1::Male||2::Male||3::Male 0::Stacie Philbrook||1::Christopher Ratliffe||... NaN 0::Killed||1::Killed||2::Killed||3::Killed 0::Victim||1::Victim||2::Victim||3::Subject-Su... http://denver.cbslocal.com/2013/01/06/officer-... 40.0 28.0
4 478959 2013-01-07 North Carolina Greensboro 307 Mourning Dove Terrace 2 2 http://www.gunviolencearchive.org/incident/478959 http://www.journalnow.com/news/local/article_d... False ... 0::18||1::46||2::14||3::47 0::Adult 18+||1::Adult 18+||2::Teen 12-17||3::... 0::Female||1::Male||2::Male||3::Female 0::Danielle Imani Jameison||1::Maurice Eugene ... 3::Family 0::Injured||1::Injured||2::Killed||3::Killed 0::Victim||1::Victim||2::Victim||3::Subject-Su... http://myfox8.com/2013/01/08/update-mother-sho... 62.0 27.0

5 rows × 29 columns

This table is rather big, so we will need to do some cleaning and tidying before we can start our analysis.

Firstly, we won't need all the data in this table. According to the dataset, some of the columns are not required - and thus, may contain NaN values. We don't want this as it will make our analysis more difficult than it needs to be. Out of the 29 columns, only 9 are required. That being said, we don't want to remove all of these unreqired columns, as some also contain value information we will need. The columns we will be removing are those that are not required and necesary for this analysis.

The following columns will be removed:

  • source_url
  • congressional_district
  • location_description
  • notes
  • participant_name
  • sources
  • state_house_district
  • state_senate_district

Here is the result:

In [3]:
columns_to_remove = [
    "source_url",
    "congressional_district",
    "location_description",
    "notes",
    "participant_name",
    "sources",
    "state_house_district",
    "state_senate_district",
]
data = data.drop(columns=columns_to_remove)
data.head()
Out[3]:
incident_id date state city_or_county address n_killed n_injured incident_url incident_url_fields_missing gun_stolen ... incident_characteristics latitude longitude n_guns_involved participant_age participant_age_group participant_gender participant_relationship participant_status participant_type
0 461105 2013-01-01 Pennsylvania Mckeesport 1506 Versailles Avenue and Coursin Street 0 4 http://www.gunviolencearchive.org/incident/461105 False NaN ... Shot - Wounded/Injured||Mass Shooting (4+ vict... 40.3467 -79.8559 NaN 0::20 0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A... 0::Male||1::Male||3::Male||4::Female NaN 0::Arrested||1::Injured||2::Injured||3::Injure... 0::Victim||1::Victim||2::Victim||3::Victim||4:...
1 460726 2013-01-01 California Hawthorne 13500 block of Cerise Avenue 1 3 http://www.gunviolencearchive.org/incident/460726 False NaN ... Shot - Wounded/Injured||Shot - Dead (murder, a... 33.9090 -118.3330 NaN 0::20 0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A... 0::Male NaN 0::Killed||1::Injured||2::Injured||3::Injured 0::Victim||1::Victim||2::Victim||3::Victim||4:...
2 478855 2013-01-01 Ohio Lorain 1776 East 28th Street 1 3 http://www.gunviolencearchive.org/incident/478855 False 0::Unknown||1::Unknown ... Shot - Wounded/Injured||Shot - Dead (murder, a... 41.4455 -82.1377 2.0 0::25||1::31||2::33||3::34||4::33 0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A... 0::Male||1::Male||2::Male||3::Male||4::Male NaN 0::Injured, Unharmed, Arrested||1::Unharmed, A... 0::Subject-Suspect||1::Subject-Suspect||2::Vic...
3 478925 2013-01-05 Colorado Aurora 16000 block of East Ithaca Place 4 0 http://www.gunviolencearchive.org/incident/478925 False NaN ... Shot - Dead (murder, accidental, suicide)||Off... 39.6518 -104.8020 NaN 0::29||1::33||2::56||3::33 0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A... 0::Female||1::Male||2::Male||3::Male NaN 0::Killed||1::Killed||2::Killed||3::Killed 0::Victim||1::Victim||2::Victim||3::Subject-Su...
4 478959 2013-01-07 North Carolina Greensboro 307 Mourning Dove Terrace 2 2 http://www.gunviolencearchive.org/incident/478959 False 0::Unknown||1::Unknown ... Shot - Wounded/Injured||Shot - Dead (murder, a... 36.1140 -79.9569 2.0 0::18||1::46||2::14||3::47 0::Adult 18+||1::Adult 18+||2::Teen 12-17||3::... 0::Female||1::Male||2::Male||3::Female 3::Family 0::Injured||1::Injured||2::Killed||3::Killed 0::Victim||1::Victim||2::Victim||3::Subject-Su...

5 rows × 21 columns

Secondly, we need to remove columns that were well-formed but are either unncecsary or contain sensitive information, like an address. We want this analysis to remain as anonymous as possible, and we want to respect those who were affected by these incidents.

We will handle NaN values on a per-situation basis. Pandas allows us to deal with these situations by offering functions like isnull() which checks if a row of data contains any NaNs. With this, we can continue our analysis without much trouble.

Here is the final result, and the data we will be using in the rest of the analysis:

In [4]:
labels = ["address", "incident_url", "incident_url_fields_missing"]
data = data.drop(columns=labels)
data.head()
Out[4]:
incident_id date state city_or_county n_killed n_injured gun_stolen gun_type incident_characteristics latitude longitude n_guns_involved participant_age participant_age_group participant_gender participant_relationship participant_status participant_type
0 461105 2013-01-01 Pennsylvania Mckeesport 0 4 NaN NaN Shot - Wounded/Injured||Mass Shooting (4+ vict... 40.3467 -79.8559 NaN 0::20 0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A... 0::Male||1::Male||3::Male||4::Female NaN 0::Arrested||1::Injured||2::Injured||3::Injure... 0::Victim||1::Victim||2::Victim||3::Victim||4:...
1 460726 2013-01-01 California Hawthorne 1 3 NaN NaN Shot - Wounded/Injured||Shot - Dead (murder, a... 33.9090 -118.3330 NaN 0::20 0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A... 0::Male NaN 0::Killed||1::Injured||2::Injured||3::Injured 0::Victim||1::Victim||2::Victim||3::Victim||4:...
2 478855 2013-01-01 Ohio Lorain 1 3 0::Unknown||1::Unknown 0::Unknown||1::Unknown Shot - Wounded/Injured||Shot - Dead (murder, a... 41.4455 -82.1377 2.0 0::25||1::31||2::33||3::34||4::33 0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A... 0::Male||1::Male||2::Male||3::Male||4::Male NaN 0::Injured, Unharmed, Arrested||1::Unharmed, A... 0::Subject-Suspect||1::Subject-Suspect||2::Vic...
3 478925 2013-01-05 Colorado Aurora 4 0 NaN NaN Shot - Dead (murder, accidental, suicide)||Off... 39.6518 -104.8020 NaN 0::29||1::33||2::56||3::33 0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A... 0::Female||1::Male||2::Male||3::Male NaN 0::Killed||1::Killed||2::Killed||3::Killed 0::Victim||1::Victim||2::Victim||3::Subject-Su...
4 478959 2013-01-07 North Carolina Greensboro 2 2 0::Unknown||1::Unknown 0::Handgun||1::Handgun Shot - Wounded/Injured||Shot - Dead (murder, a... 36.1140 -79.9569 2.0 0::18||1::46||2::14||3::47 0::Adult 18+||1::Adult 18+||2::Teen 12-17||3::... 0::Female||1::Male||2::Male||3::Female 3::Family 0::Injured||1::Injured||2::Killed||3::Killed 0::Victim||1::Victim||2::Victim||3::Subject-Su...

Since 2013 was when data collection it is not exhaustive (as stated in the dataset) so it doesn't give an accurate representation on the year. We decided to remove it due to this.

In [5]:
data = data[data["date"].str.contains("2013") == False]

Now we will convert date to datetime so we can use it later.

In [6]:
data["date"] = pd.to_datetime(data["date"])
data.head()
Out[6]:
incident_id date state city_or_county n_killed n_injured gun_stolen gun_type incident_characteristics latitude longitude n_guns_involved participant_age participant_age_group participant_gender participant_relationship participant_status participant_type
278 95289 2014-01-01 Michigan Muskegon 0 0 NaN NaN Shots Fired - No Injuries 43.2301 -86.2514 NaN NaN 0::Adult 18+ 0::Female NaN 0::Unharmed 0::Victim
279 92401 2014-01-01 New Jersey Newark 0 0 NaN NaN Officer Involved Incident 40.7417 -74.1695 NaN NaN NaN NaN NaN NaN NaN
280 92383 2014-01-01 New York Queens 1 0 NaN NaN Shot - Dead (murder, accidental, suicide) 40.7034 -73.7474 NaN 0::22||1::26 0::Adult 18+||1::Adult 18+ 0::Male||1::Male NaN 0::Killed||1::Unharmed 0::Victim||1::Subject-Suspect
281 92142 2014-01-01 New York Brooklyn 0 1 NaN NaN Shot - Wounded/Injured 40.6715 -73.9476 NaN 0::34 0::Adult 18+||1::Adult 18+ 0::Male||1::Male NaN 0::Injured 0::Victim||1::Subject-Suspect
282 95261 2014-01-01 Missouri Springfield 0 1 NaN NaN Shot - Wounded/Injured 37.2646 -93.3007 NaN 0::6||1::12 0::Child 0-11||1::Teen 12-17 0::Female NaN 0::Injured||1::Unharmed 0::Victim||1::Subject-Suspect

Now we will create columns for each part of the date

In [7]:
data["year"] = data["date"].dt.year
data["month"] = data["date"].dt.month
data["day"] = data["date"].dt.day
data["month_year"] = data["date"].dt.to_period("M")
data.head()
Out[7]:
incident_id date state city_or_county n_killed n_injured gun_stolen gun_type incident_characteristics latitude ... participant_age participant_age_group participant_gender participant_relationship participant_status participant_type year month day month_year
278 95289 2014-01-01 Michigan Muskegon 0 0 NaN NaN Shots Fired - No Injuries 43.2301 ... NaN 0::Adult 18+ 0::Female NaN 0::Unharmed 0::Victim 2014 1 1 2014-01
279 92401 2014-01-01 New Jersey Newark 0 0 NaN NaN Officer Involved Incident 40.7417 ... NaN NaN NaN NaN NaN NaN 2014 1 1 2014-01
280 92383 2014-01-01 New York Queens 1 0 NaN NaN Shot - Dead (murder, accidental, suicide) 40.7034 ... 0::22||1::26 0::Adult 18+||1::Adult 18+ 0::Male||1::Male NaN 0::Killed||1::Unharmed 0::Victim||1::Subject-Suspect 2014 1 1 2014-01
281 92142 2014-01-01 New York Brooklyn 0 1 NaN NaN Shot - Wounded/Injured 40.6715 ... 0::34 0::Adult 18+||1::Adult 18+ 0::Male||1::Male NaN 0::Injured 0::Victim||1::Subject-Suspect 2014 1 1 2014-01
282 95261 2014-01-01 Missouri Springfield 0 1 NaN NaN Shot - Wounded/Injured 37.2646 ... 0::6||1::12 0::Child 0-11||1::Teen 12-17 0::Female NaN 0::Injured||1::Unharmed 0::Victim||1::Subject-Suspect 2014 1 1 2014-01

5 rows × 22 columns

Now that our data has been cleaned up, it's time to explain what we are looking at. This dataset tracked every since recorded incident of gun violence between early 2013 and early 2018 in the United States. It contains all the critical information we need to understand each incident that occured, such as where and when it happened, who was involved, and what the outcome was. Below is a summary of each column and what it tells us about the incident.

  • date: when the incident occured
  • state: what state the incident occured in
  • city_or_county: what city or county the incident occured in
  • n_killed: how many people were killed in the incident
  • n_injured: how many people were injured in the incident
  • gun_stolen: whether or not the gun/guns used were stolen
  • gun_type: what type of gun/guns were used
  • incident_characteristics: specific details about the incident
  • latitude: geographic latitude of the incident
  • longitude: geographic longitude of the incident
  • n_guns_involved: how many guns involved in the incident
  • participant_age: a breakdown of each participant's age
  • participant_age_group: a breakdown of each participant's age group
  • participant_gender: a breakdown of each participant's gender
  • participant_relationship: a breakdown of each participant's relationship to other participants
  • participant_status: a breakdown of the outcome of each participant
  • participant_type: a breakdown of each participant's role in the incident

Part 3 - Analysis¶

Graphs¶

To begin our analysis, we want to get a good understanding of the data.

In [8]:
import matplotlib.pyplot as plt
import seaborn as sns

Distribution of Fatalities in Mass Shootings¶

In [9]:
frequencies = {}
for _, row in data.iterrows():
    if row["n_killed"] not in frequencies:
        frequencies[row["n_killed"]] = 1
    else:
        frequencies[row["n_killed"]] += 1
for i in range(4):
    frequencies.pop(i, None)
In [10]:
plt.bar(frequencies.keys(), frequencies.values(), width=0.7)
plt.xlim([0, 27])
plt.xlabel("Number of Fatalities")
plt.ylabel("Frequency")
plt.title("Distribution of Fatalities in Mass Shootings")
plt.show()
In [11]:
k_freq = data["n_killed"].value_counts(normalize=True).iloc[0:6]
sns.barplot(x=k_freq.index, y=k_freq.values)
Out[11]:
<AxesSubplot: >

Frequency of Different Gun Types Used in Shootings¶

In [12]:
gun_types = {"Handgun": 0, "Rifle": 0, "Shotgun": 0}
gun_type_df = data.dropna(subset=['gun_type'])
for _, row in gun_type_df.iterrows():
    gun_types["Handgun"] += row["gun_type"].count("Handgun")
    gun_types["Rifle"] += row["gun_type"].count("Rifle")
    gun_types["Shotgun"] += row["gun_type"].count("Shotgun")
In [13]:
plt.bar(gun_types.keys(), gun_types.values())
plt.xlabel("Gun Type")
plt.ylabel("Frequency")
plt.title("Frequency of Different Gun Types Used in Shootings")
plt.show()

Male Verses Deaths Involvement in Gun Violence¶

In [14]:
male_vs_female = {"Child 0-11": [0, 0], "Teen 12-17": [0, 0], "Adult 18+": [0, 0]}

gender_age_df = data.dropna(subset=["participant_gender", "participant_age_group"])

for _, row in gender_age_df.iterrows():
    tokens_gender = row["participant_gender"].split("||")
    tokens_gender = [e[3:] for e in tokens_gender]
    tokens_age_grp = row["participant_age_group"].split("||")
    tokens_age_grp = [e[3:] for e in tokens_age_grp]
    result = list(zip(tokens_gender, tokens_age_grp))
    for pair in result:
        if pair[0] == "Male":
            if pair[1] == "Child 0-11":
                male_vs_female["Child 0-11"][0] += 1
            elif pair[1] == "Teen 12-17":
                male_vs_female["Teen 12-17"][0] += 1
            elif pair[1] == "Adult 18+":
                male_vs_female["Adult 18+"][0] += 1
        elif pair[0] == "Female":
            if pair[1] == "Child 0-11":
                male_vs_female["Child 0-11"][1] += 1
            elif pair[1] == "Teen 12-17":
                male_vs_female["Teen 12-17"][1] += 1
            elif pair[1] == "Adult 18+":
                male_vs_female["Adult 18+"][1] += 1
In [15]:
labels = ["Adult", "Teen", "Child"]
male_data = [
    male_vs_female["Adult 18+"][0],
    male_vs_female["Teen 12-17"][0],
    male_vs_female["Child 0-11"][0],
]
female_data = [
    male_vs_female["Adult 18+"][1],
    male_vs_female["Teen 12-17"][1],
    male_vs_female["Child 0-11"][1],
]

x_axis = np.arange(len(labels))
width = 0.35

fig, ax = plt.subplots()
fig.set_figwidth(10)
fig.set_figheight(8)
rects1 = ax.bar(x_axis - width / 2, male_data, width, label="Male")
rects2 = ax.bar(x_axis + width / 2, female_data, width, label="Female")

ax.set_xlabel("Age Group")
ax.set_ylabel("Amount of Involvement")
ax.set_ylim([0, 275000])
ax.set_title("Male Verses Female Involvement in Gun Violence")
ax.set_xticks(x_axis, labels)
ax.legend()
ax.bar_label(rects1, padding=3)
ax.bar_label(rects2, padding=3)
plt.show()

Mean Age of Participants Between 15 and 75 Verses Lethality¶

Lethality is calculated using the following formula:

$ 2* Participants\ Killed + 1.5 * Participants\ Injured $

In [16]:
def mean_age_of_participants(row):
    ages = {k: 0 for k in range(15, 75)}
    for age in ages.keys():
        count = row.count(str(age))
        ages[age] += count
    lst = []
    for key, value in ages.items():
        if key * value != 0:
            lst.append(key * value)
    sum_of_ages, num_of_ages = float(sum(lst)), float(len(lst))
    if sum_of_ages == 0:
        return "Invalid"
    else:
        return sum_of_ages / num_of_ages


age_df = data.dropna(subset=["participant_age"])
raw, filtered = [], []
for _, row in age_df.iterrows():
    [mean_age, lethality] = mean_age_of_participants(row["participant_age"]), float(
        ((2 * row["n_killed"]) + (1.5 * row["n_injured"]))
    )
    raw.append([mean_age, lethality])
for entry in raw:
    if entry[0] != "Invalid":
        filtered.append(entry)
In [17]:
x_data, y_data = [], []
for entry in filtered:
    if entry[0] < 75 and entry[1] < 100:
        x_data.append(entry[0])
        y_data.append(entry[1])
[slope, intercept] = np.polyfit(x_data, y_data, 1)
plt.figure(figsize=(10, 8))
plt.scatter(x_data, y_data, s=30, edgecolor="black")
plt.xlabel("Mean Age of Participants")
plt.ylabel("Measure of Lethality")
plt.title(
    "Mean Age of Participants Between 15 and 75 Years Old In Shootings Verses Lethality"
)
plt.plot(np.asarray(x_data), slope * np.asarray(x_data) + intercept, color="orange")
plt.show()

Maps¶

One good way to view this data set is by generating a map. To do this first we get a geojson file containing the relevant infomation for each state. Then we count all entries by state and add it. This way we can graph both together.

In [18]:
# Getting GeoJson of US states from the folium and saving as geopandas(so we can add GeoJson tooltips)
# Source: https://raw.githubusercontent.com/python-visualization/folium/master/examples/data/us-states.json
import geopandas as gpd

state_geo = gpd.read_file("data/us-states.json")
In [19]:
# Summing up incidents per state
incident_count = data["state"].value_counts().reset_index()
incident_count.columns = ["name", "count"]
# Then merging since folium only does one data source for GeoJson
state_geo_count = state_geo.merge(incident_count, on="name")

Now that we have a valid dataframe we need to create our maps.

In [20]:
from folium import Map, Choropleth
from folium.features import GeoJson, GeoJsonTooltip

total_shootings_by_state_map = Map(location=[43, -102], zoom_start=4)

Choropleth(
    geo_data=state_geo,
    data=incident_count,
    bins=9,
    columns=["name", "count"],
    key_on="feature.properties.name",
    legend_name="Total shootings in state from 2014-2018",
    fill_color="YlOrRd",
    fill_opacity=0.7,
    line_opacity=0.5,
    reset=True,
).add_to(total_shootings_by_state_map)
Out[20]:
<folium.features.Choropleth at 0x1a860bfd0>

That last cell made a map, then created a choropleth layer using the state geojson and counts.

In [21]:
style = lambda x: {
    "fillColor": "#ffffff",
    "color": "#000000",
    "fillOpacity": 0.1,
    "weight": 0.1,
}

highlight = lambda x: {
    "fillColor": "#000000",
    "color": "#000000",
    "fillOpacity": 0.30,
    "weight": 0.1,
}

gjson = GeoJson(
    data=state_geo_count,
    style_function=style,
    highlight_function=highlight,
    control=False,
    tooltip=GeoJsonTooltip(
        fields=["name", "count"],
        aliases=["State", "Shootings"],
    ),
)
total_shootings_by_state_map.add_child(gjson)
total_shootings_by_state_map.keep_in_front(gjson)

We create 3 things in this. First we create a highlight and style function, which just specify the colors and opacity for their namesake. Then the geojson object, which applies the functions and creates a tooltip (or popup) with the relevant info when you hover over a state.

In [22]:
# Showing the map
total_shootings_by_state_map
Out[22]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Now we will make a time base heatmap.
First we must make the time index and group all latitude/logitude pairs that occured within each month.
This approach is complicated but faster than doing it by for loop for some reason.

In [23]:
heatmap_df = data.dropna(subset=["month_year", "latitude", "longitude"])
heat_data = (
    heatmap_df[["month_year", "latitude", "longitude"]]
    .groupby("month_year")
    .apply(lambda row: [list(tup) for tup in zip(row["latitude"], row["longitude"])])
    .tolist()
)
In [24]:
time_index = list(heatmap_df["month_year"].astype("str").sort_values().unique())

Now we must make our actual map

In [25]:
from folium.plugins import HeatMapWithTime

heatmap = Map(location=[43, -102], zoom_start=4)

HeatMapWithTime(
    heat_data,
    index=time_index,
    radius=10,
    auto_play=False,
    speed_step=1,
    min_speed=1,
).add_to(heatmap)
Out[25]:
<folium.plugins.heat_map_withtime.HeatMapWithTime at 0x1ada56800>

MAP:¶

(Zoom in to see specific areas)

In [26]:
heatmap
Out[26]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Another interesting question is how did the 2016 presidential election effect gun violence?

First let's get 2015-2017 from the dataframe.

In [27]:
cpm = data["month_year"].value_counts().sort_index().to_frame()
cpm.columns = ["count"]
cpm["year"] = cpm.index.year
cpm["month"] = cpm.index.month
# months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
cpm.head()
Out[27]:
count year month
2014-01 4395 2014 1
2014-02 3045 2014 2
2014-03 3669 2014 3
2014-04 3891 2014 4
2014-05 4320 2014 5

Visualizing this gives us the following:

In [28]:
sns.set_theme()
election_df = cpm.query("year >= 2015 and year <= 2017")
sns.relplot(election_df, x="month", y="count", col="year", kind="line")
Out[28]:
<seaborn.axisgrid.FacetGrid at 0x10affcfd0>

Overlayed:¶

In [29]:
sns.lineplot(election_df, x="month", y="count", hue="year")
Out[29]:
<AxesSubplot: xlabel='month', ylabel='count'>

So it does appear gun violence spiked starting around November 2016. However in the overlay we see that this seems to occur every year. But the gun violence seems to be increasing. To confirm this we could plot all available years.

In [30]:
sns.lineplot(cpm, x="month", y="count", hue="year")
Out[30]:
<AxesSubplot: xlabel='month', ylabel='count'>

We could also average the months over the recorded years to see this:

In [31]:
mc = cpm.groupby("month")["count"].mean()
sns.lineplot(x=mc.index, y=mc.values)
Out[31]:
<AxesSubplot: xlabel='month'>

They clearly all follow the same trends. The only difference it appears is each year the volume of gun violence increased. To confirm we could sum up the whole year and check.

In [32]:
yc = cpm.groupby("year")["count"].sum()
yc
Out[32]:
year
2014    51854
2015    53579
2016    58763
2017    61401
2018    13802
Name: count, dtype: int64

As we can see 2018 only had a few months of recorded data in the set, so we should probably remove it when comparing years.

In [33]:
yc = yc.drop(index=yc.index[-1:])
In [34]:
sns.barplot(x=yc.index, y=yc.values)
Out[34]:
<AxesSubplot: xlabel='year'>

So overall gun violence has been increasing each year.

In [35]:
cpd = data["date"].value_counts().sort_index().to_frame().copy()
cpd.index = pd.to_datetime(cpd.index)

cpd.columns = ["count"]
cpd["year"] = cpd.index.year
cpd["month"] = cpd.index.month
cpd["day"] = cpd.index.day
cpd.head()
Out[35]:
count year month day
2014-01-01 216 2014 1 1
2014-01-02 119 2014 1 2
2014-01-03 124 2014 1 3
2014-01-04 140 2014 1 4
2014-01-05 130 2014 1 5
In [36]:
from sklearn.linear_model import LinearRegression

train_df = cpd.query("year < 2018").copy()

X_train = train_df.iloc[:, 1:4].values
y_train = train_df.iloc[:, 0].values.reshape(-1, 1)
reg = LinearRegression()
reg.fit(X_train, y_train)

train_df["pred"] = train_df.apply(
    lambda row: float(reg.predict([[row["year"], row["month"], row["day"]]])), axis=1
)
train_df = train_df.drop(columns=["year", "day", "month"])
reg.score(X_train, y_train)
Out[36]:
0.18705909256762032
In [37]:
plt.xticks(rotation=45)
sns.lineplot(data=train_df)
Out[37]:
<AxesSubplot: >
In [38]:
test_df = cpd.query("year == 2018").copy()
X_test = test_df.iloc[:, 1:4].values
y_test = test_df.iloc[:, 0].values.reshape(-1, 1)
test_df["pred"] = test_df.apply(
    lambda row: float(reg.predict([[row["year"], row["month"], row["day"]]])), axis=1
)
test_df = test_df.drop(columns=["year", "day", "month"])
reg.score(X_test, y_test)
Out[38]:
-0.8194541412845129
In [39]:
plt.xticks(rotation=45)
sns.lineplot(data=test_df)
Out[39]:
<AxesSubplot: >
In [40]:
from sklearn.preprocessing import PolynomialFeatures

train_df = cpd.query("year < 2018").copy()

X_train = train_df.iloc[:, 1:4].values
y_train = train_df.iloc[:, 0].values.reshape(-1, 1)
poly = PolynomialFeatures(5)
poly_X_train = poly.fit_transform(X_train)

clf = LinearRegression()
clf.fit(poly_X_train, y_train)

train_df["pred"] = train_df.apply(
    lambda row: float(
        clf.predict(poly.fit_transform([[row["year"], row["month"], row["day"]]]))
    ),
    axis=1,
)
train_df = train_df.drop(columns=["year", "day", "month"])
plt.xticks(rotation=45)
sns.lineplot(data=train_df)
Out[40]:
<AxesSubplot: >
In [41]:
test_df = cpd.query("year == 2018").copy()
X_test = test_df.iloc[:, 1:4].values
y_test = test_df.iloc[:, 0].values.reshape(-1, 1)

poly_X_test = poly.fit_transform(X_test)

test_df["pred"] = test_df.apply(
    lambda row: float(
        clf.predict(
            poly.fit_transform(
                [
                    [
                        row["year"],
                        row["month"],
                        row["day"],
                    ]
                ]
            )
        )
    ),
    axis=1,
)
test_df = test_df.drop(columns=["year", "day", "month"])
clf.score(poly_X_test, y_test)
Out[41]:
-1.4608151418296078
In [42]:
plt.xticks(rotation=45)
sns.lineplot(data=test_df)
Out[42]:
<AxesSubplot: >
In [ ]: